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Abstract 

It is shown that metric representation of DNA sequences is one-to-one. By 
using the metric representation method, suppression of nucleotide strings in 
the DNA sequences is determined. For a DNA sequence, an optimal string 
length to display genomic signature in chaos game representation is obtained 
by eliminating effects of the finite sequence. The optimal string length is fur- 
ther shown as a self-similarity limit in computing information dimension. By 
using the method, self-similarity limits of bacteria complete genomic signa- 
tures are further determined. 

I. INTRODUCTION 

Along with an increasing amount of DNA sequences extracted from experiments, it is im- 
portant to develop methods for extracting meaningful information from the one-dimensional 
symbolic sequences composed of the four letters 'A', 'C, 'G' and 'T' (or 'U'). To detect 
similarity in DNA sequences, scatter plots [|l| are introduced to make classification of cy- 
tochromes and illustrate a dendrogram. From a comparison of a pair of duplicated genes 
by a distance matrix, evolutionary relationship of three primary kingdoms of life is inferred 
0. Due to investigating relative abundances of short oligonucleotides in subsequences, 
genomic signature phenomenon and derivation of partial-ordering relationships among bac- 
terial genomes are proposed The genomic signature describes that the difference of 
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dinucleotide relative abundance values within a single genome is larger than that between 
distinct genomes. Chaos game representation (CGR) |Q, which generates a two-dimensional 
square from a one-dimensional sequence, provides a technique to visualize the composition 
of DNA sequences. By composing the CGR and short- sequence representation methods, the 
evolution of species- type specificity in mitochondral genomes is analyzed 0. In terms of the 
CGR method, it is shown that the main characteristics of the whole genome can be exhib- 
ited by its subsequences 0. The genomic signature is extended to describe characteristics 
of CGR images. By making a Euclidean metric between two CGR images, classification of 
species in three primary kingdoms is discussed. 

Recently, metric representation (MR) , which is borrowed from the symbolic dynamics, 
makes an ordering of subsequences in a plane. The MR method is an extension of CGR. 
Suppression of certain nucleotide strings in the DNA sequences leads to a self-similarity 
of pattern seen in the MR of DNA sequences. In this paper, first, we show that the MR 
is one-to-one. Due to the MR method, we determine suppression of nucleotide strings in 
DNA sequences. Then, eliminating effects of finite sequences on suppression of nucleotide 
strings, we give an optimal string length to display genomic signature. Moreover, we plot 
information function versus string lengths to determine self-similarity limits in MR images. 
Using the method, we present self-similarity limits of bacteria complete genomic signatures. 



II. SUPPRESSION OF NUCLEOTIDE STRINGS 

For a given DNA sequence, we have a one-dimensional symbolic sequence siS2 ■ ■ ■ Si ■ ■ ■ sn 
{si G {A,C,G,T}). In a two-dimensional MR, we take the correspondence of symbol to 
number /ij or z/j G {0, 1} and calculate the values (a, /?) of all subsequences S.^ = siS2 ■ ■ ■ Sm 
{1 < m < N). The number a represented in base 3, between and 1, is defined as 

m m 

j=i i=i 
where fit is if Sj G {A, C} or 1 if Sj G {G, T}. Similarly, the number /? is defined as 
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P = 2J2 ^m-Hi3-' + 3-"^ = 2 5] z.,3-(™-^+i) + 3-™, (2) 
j=i i=i 

where z/j is if Sj G {A,T} or 1 if Sj G {C,G}. According to (1) and (2), the one- 
dimensional symbohc sequence siS2 • • • sat is partitioned into 4 kinds of subsequences, which 
correspond to points in 4 fundamental zones A, C, G and T of Fig. 1. Under left or right shift 
operators, each zone can be further shrunk to less zones with a factor of 1/3^. For an infinite 
sequence, this procedure can be defined as a fractal 0, which has a self-similarity. The 
subsequences with the same ending /c-nucleotide string are labeled by All subsequences 
correspond to points in the zone encoded by the ^-nucleotide string. 

Lemma 1 («, /3){S'(Sm)} = 2(/im+i, ^'m+i)/3 + (a, /5){S.m}/3. 5 is a left shift operator. 

PROOF: Note that for the left shift operator, S'(Sm) = SmSm+i- From the definition (1) 
and (2), we can immediately obtain the result. 

Lemma 2 = («, /?){G'°°S^}. 

PROOF: When m = 1, Ei = si and = S{G°°). By the Lemma 1, we can ob- 

tain = 2(/ii,z/i)/3 + (a,/3){G°°}/3 = 2(/ii, z/i)/3 + (1, l)/3 = 

Suppose when m = i, we have = (a, For m = i + 1, we have 

= SjSj+i = S'(Sj) and = S{G°°T,i). By the Lemma 1, we obtain (a, = 

2(/i„+i,z/„+i)/3 + (a,/3){S,}/3and (a, = 2(/i„+i, i/^+i)/3 + (a, /?){G-S,;}/3. 
So, using the supposition = (a, /?){G°°Sj}, we can lead to (a, = 

By the Lemma 2, each finite subsequence has a correspondent infinite sequence 
G°°Tim- Here, we define a set of the infinite sequences as S. 

Theorem 1 : S ^ A is one-to-one. A is a set of points in the {a, (3) plane. 

This means that given S^, g S, if ^ S^, then («,/?){Ei} ^ We 
give a proof by contradiction. Suppose = and is marked as P in 

the the (a,/3) plan. For the zone including the point P, we encode it as two subsequences 
Ti\ and with the same mononucleotide. Then, enlarge the zone by a area factor of 3^, 
we can obtain two encoding subsequences and with the same dinucleotide. Each 
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enlarging process provides a right shift to two subsequences. At the same time, the point P 
is only included in one of four enlarged zones. So, two shifting subsequences are the same. 
Following the enlarging process in an infinite step, we can obtain = E^, contradicting our 
original assumption. This contraction is due to the fact that we have assumed {a, /3){E^} = 
(q;,/?){S2}; thus, E^ ^ E2, then {a,f3){E^} {a,(3){J:^}. 

For the DNA sequence, some zones in CGR are replenished by points, so that a pattern 
appears. In CGR, there exists an correspondence of more subsequences with different ending 
fc-nucleotide strings to the same points in bounds of zones. For examples, subsequences G°°A 
in the zone A, T°°C in the zone C, A°°G in the zone G and C°^T in the zone T have the same 
points in CGR (1/2,1/2). Under left shift operators, the corresponding relation between 
points and subsequences can preserve in zones with small enough lengths. For example, 
subsequences G'^AC in the zone AC, T°°C'^ in the zone C^, A°°GC in the zone GC and 
C°°TC in the zone TC have the same points in CGR (1/4,3/4). In MR of DNA sequences, 
each zone in CGR is shrunk and clearly divided by four bands. There exists a one-to-one 
correspondence between zones and ending A;— nucleotide strings of subsequences. Frequency 
of points in the zone can be determined by using MR method as follows. In order to compute 
frequencies in zones encoded by /c-nucleotide strings, we need to determine partition lines 
of MR in Fig. 1. For mononucleotides, there exist 2x2 zones in the MR. We have ni(= 3) 
partition lines bl — 0, b\ — 1/2 and bl — 1 along the a axis. For denucleotides, there exist 
4x4 zones in the MR. We have n2(= 5) partition hues b^ = 6j = 0, bf = b\/3 = 1/6, 
6| = 6| = 1/2, 6| = 1 — 6^ = 5/6 and 6| = 1 — 6q = 1 along the a axis. In general, for k — 1- 
nucleotide strings, if knowing nk-i{— 2*^~^ -|- 1) partition hues b^~^{i — 0, 1, • • • ,nk-i — 1) 
along the a axis, we can obtain nk{— 2''+! — 2nk-i — l) partition lines b^{i = 0, 1, • • • , rife — 1) 
for /c-miclcotide strings as follows. For the /c-nucleotide strings, there exist 2^ x 2^^' zones in 
the MR. The left half (0 < i < rik-i — 1) of partition hues along the a axis are described as 
follows 
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/3, 



ifi%2 
ifi%2 



1. 



(3) 



From (3), the right half {uk-i < i < — 1) of partition hnes along the a axis can be 
determined immediately 

= 1 - &n,-l-,- (4) 

For example, for trinucleotides, 9 partition lines along the a axis are 0, ^,|,^,|,y|,|,y8 
and 1. We can obtain 17 partition lines 0, ^, ^, ^, i, if, ^, i|, i, f|, if, §, |, f, i|, § and 
1 along the a axis for tetranucleotides. Partition lines along the (3 axis are the same to those 
along the a axis. Each zone in the MR can thus be surrounded by the combined partition 
lines along the a and f3 axes. 

Using the MR method, we determine suppression of fc-nucleotide strings in HUMHBB 
(human /5-region, chromosome 11) with 73308 bases and YEASTl (yeast chromosome 1) 
with 230209 bases in Table I, respectively. In order to check efficiency of the method, we 
also determine the number of disappearing strings in all strings for a giving string length 
in HUMHBB and YEASTl, respectively. The results are identical with those in Table I, 
respectively. So, the MR method is effective to determine suppression of nucleotide strings 
in DNA sequences. 

In CGR of DNA sequences, self-similarity patterns change more obscurely as lengths 
of sequences increase. A grey plot describes frequency values in small zones, which sizes 
(2^'^ X 2~^) can be given by lengths of strings encoding the zones (k). Along with increase of 
string lengths, the self-similarity patterns in CGR are more clear. A high and low frequent 
zones are redivided to smaller and described by a grey scale. Some empty zones may appear 
in the patterns of CGR, i.e., some nucleotide strings are suppressed in the sequences. In 
the procedure of decreasing zone sizes, more and more empty zones emerge in the patterns 
of CGR. For example, evolution of a self-similarity pattern in CGR of the archaebacteria 
Archeoglobus fulqidus is shown in Fig. 1 of Ref. P]. If DNA sequences are infinite, the 
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compositional structure can be displayed in small enough zones. Empty zones are a part 
of the global feature in CGR. But the DNA sequences are finite. A finite sequence, even 
a random sequence, may also lead to suppression of strings. Along with increase of string 
length, more and more strings are suppressed in the finite sequences. 

In Table I, we compare the suppression of nucleotide strings between DNA and random 
sequences with the same length. Suppression of nucleotide strings for HUMHBB starts at 
k=5. For a random sequence with the same length, which is given by using a random num- 
ber generator P], suppression of nucleotide strings is delayed to start at k=7. The number 
of suppressed nucleotide strings for the random number is 5.67% of that for HUMHBB. 
The finite sequence of HUMHBB effects on the suppression of 7-nucleotide strings. Along 
with increase of k, numbers of suppressed nucleotide strings for the random number more 
increase and approach those for HUMHBB. At k =10, the number of suppressed nucleotide 
strings for the random number is 99.3% of that for HUMHBB. In this case, suppression of 
nucleotide strings in HUMHBB is mainly caused by the finite length of sequence. Moreover, 
suppression of nucleotide strings for YEASTl starts at k=7. For a random sequence with 
the same length, which is given by using a random number generator |^, suppression of 
nucleotide strings is delayed to start at A;=8. The number of suppressed nucleotide strings 
for the random number is 22.7% of that for YEASTl. The finite sequence of YEASTl effects 
on the suppression of 8-nucleotide strings. At k=10, the number of suppressed nucleotide 
strings for the random number is 97.5% of that for YEASTl. Due to the comparison of 
suppression of nucleotide strings, we can thus obtain that HUMHBB and YEASTl have 
shorter suppressed nucleotide strings than random sequences with the same lengths, respec- 
tively. Along with increase of string lengths, the finite sequences take stronger effects on 
suppression of nucleotide strings. 

In order to display genomic signature, we must eliminate effects of finite sequences on 
suppression of nucleotide strings. For a DNA sequence, we take the longest string length 
before suppression of nucleotide strings in a random sequence with the same lengths as an 
optimal option of string lengths. According to the definition, string lengths 6 and 7 can be 
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chosen as optimal options for genomic signatures of HUMHBB and YEASTl, respectively. 



III. LIMITS OF SELF-SIMILARITY SCALES 

Suppression of certain nucleotide strings in the DNA sequences leads to a fractal pattern 
seen in the MR of DNA sequences. To quantify the fractal feature in the MR of DNA 
sequences, we introduce information dimension. For a given length k of nucleotide strings, 
we have M(= — A; + 1) subsequences = k, k + 1, ■ ■ ■ , N), which end with M k- 
nucleotide strings. The subsequences are corresponding to M points in a MR. In the MR, 
the length of a zone and the total number of zones are e = and Z = 4^, respectively. 
The numbers of points falling in the i-th zone and of non-empty zones are labeled by mj(e) 
and Z{e), respectively. Dividing the number mj(e) by the total point number M yields a 
probability Pi{e) for the i-th zone. Information function and dimension for the points in MR 

as 



are respectively defined [10 



^(e) = - Pilom, (5) 

2=1 



and 



Di=lim-4^. (6) 

.-0 log(l/e) ^ ^ 

The information function J(e) during a range of log(l/e) has a scaling region. The scaling 
region refiects the self-similarity of pattern in the MR. The information dimension Di can 
be found from the slope in scaling region /(e) versus log(l/e). When the length e of a zone 
in MR increases from 3^*^ to 2^*^, MR of DNA sequences changes to CGR. Information 
dimension in CGR can thus be determined as (log23)-Di. We compute information function 
/(e) with different sizes e for HUMHBB and draw in Fig. 2. A linear part of the curve 
/(e) versus log(l/e) emerges between log(l/e) = log3 = 1.10 and log(l/e) = 61og3=6.59. A 
fitting line is also draw in Fig. 2. The point for log(l/e) = 71og3=7.69 is started leaving 
from the line. Along with the decrease of log(l/e), farther and farther the points leave 
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from the line. Since points in the zones correspond to fc-nucleotide strings, we can obtain 
that the self-similarity of pattern in the MR preserves approximately from mononucleotides 
to 6-nucleotide strings, as well as the suppression of many nucleotide strings emerges at 
7- nucleotide strings. Using the least-squares fit method 0] for the liner part, we determine 
its slope, i.e., information dimension Di, to 1.20. It is less than the information dimension 
1.26 for random sequence with the same length. Moreover, in Fig. 3, we draw information 
function J(e) versus log(l/e) for YEASTl. A linear part of the curve J(e) versus log(l/e) 
exists between log(l/e) = log3 = 1.10 and log(l/e) = 71og3=7.69. We can obtain that the 
suppression of many nucleotide strings in YEASTl emerges from 8-nucleotide strings. Using 
the least-squares fit method for the liner part, we also plot a fitting line in Fig. 3 and 
determine its slope, i.e., information dimension Di, to 1.22. It is less than the information 
dimension 1.26 for random sequence with the same length. The limits of self-similarity in 
MR of HUMHBB and YEASTl are equivalent to the optimal string lengths for genomic 
signatures, respectively. Thus, for presenting genomic signature, a self-similarity limit as an 
optimal string length can be determined in computing information dimension. 

Using the MR method, we determine suppression of ^-nucleotide strings of bacteria 
complete genomes in Table II, where we put suppression of fc-nucleotide strings in the order 
of decrease. For each of the bacteria complete genomes, a linear part exists in the plot of 
information function J(e) versus log(l/e). From the linear parts, we determine self-similarity 
limits of genomic signatures in Table II. Keeping in the order, we find the suppression of 
bacteria complete genomes does not necessarily depend on the lengths of sequences. The 
common optimal string length for the bacteria complete genomic signatures can be chosen 
as 7. 

IV. CONCLUSION 

In summary, we have shown MR of DNA sequences is one-to-one. Due to the MR 
method, suppression of nucleotide strings in the DNA sequences is determined. For a DNA 
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sequence, an optimal string length to display genomic signature is obtained by eliminating 
effects of the finite sequence. The optimal string length is further shown as a self-similarity 
limit in computing information dimension. By using the method, self-similarity limits of 
bacteria complete genomic signatures are further determined. 
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FIGURES 

Fig. 1 Metric representation of HUMHBB. Its boundary and partition lines are labeled by 
solid lines and dash lines, respectively. 

Fig. 2 A plot of information function 7(e) versus log(l/e) labeled by dots and its fitting fine 
for HUMHBB. 

Fig. 3 A plot of information function 7(e) versus log(l/e) labeled by dots and its fitting fine 
for YEASTl. 
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TABLES 



Table I. Suppression of fc-nucleotide strings in HUMHBB, YEASTl and random sequences. The total 
numbers of nucleotide strings for a length k and suppressed fc-nucleotide strings, are labeled by 11^ and Afe, 
respectively. 



k 


5 


6 7 8 9 10 




1024 


4096 16384 65536 262144 1048576 


j^HUM H BB 1 ^Random (733O8) 


4/0 


244/0 3667/208 32909/21402 209280/198219 985222/977852 


^YEASTl 1 j^Random (230209) 


0/0 


0/0 110/0 8897/2021 134302/109290 863555/842246 
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Table II. Suppression of fc-nucleotide strings and self-similarity limits of bacteria complete genomes labeled 
by Kk and ki, respectively. 
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